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Abstract. We present a new algorithm that computes eigenvalues and eigenvectors of a Her- 
mitian positive definite matrix while solving a linear system of equations with Conjugate Gradient 
(CO). Traditionally, all the CG iteration vectors could be saved and recombined through the eigen- 
vectors of the tridiagonal projection matrix, which is equivalent theoretically to unrcstarted Lanczos. 
Our algorithm capitalizes on the iteration vectors produced by CG to update only a small window 
of vectors that approximate the eigenvectors. While this window is restarted in a locally optimal 
way, the CG algorithm for the linear system is unaffected. Yet, in all our experiments, this small 
window converges to the required eigenvectors at a rate identical to unrestarted Lanczos. After the 
solution of the linear system, eigenvectors that have not accurately converged can be improved in 
an incremental fashion by solving additional linear systems. In this case, eigenvectors identified in 
earlier systems can be used to deflate, and thus accelerate, the convergence of subsequent systems. 

We have used this algorithm with excellent results in lattice QCD applications, where hundreds 
of right hand sides may be needed. Specifically, about 70 eigenvectors are obtained to full accuracy 
after solving 24 right hand sides. Deflating these from the large number of subsequent right hand 
sides removes the dreaded critical slowdown, where the conditioning of the matrix increases as the 
quark mass reaches a critical value. Our experiments show almost a constant number of iterations 
for our method, regardless of quark mass, and speedups of 8 over original CG for light quark masses. 

Keywords: Hcrmitian linear systems, multiple right hand sides, eigenvalues, 
deflation, Lanczos, Conjugate Gradient 

1. Introduction. The numerical solution of linear systems of equations of large, 
sparse matrices is central to many scientific and engineering applications. One of the 
most computationally demanding applications is lattice Quantum Chromodynamics 
(QCD) because not only does it involve very large matrix sizes but also requires 
the solution of several linear systems with the same matrix but different right hand 
sides. Direct methods, although attractive for multiple right hand sides, cannot be 
used because of the size of the matrix. Iterative methods provide the only means for 
solving these problems. 

QCD is the theory of the fundamental force known as strong interaction, which 
describes the interactions among quarks, one of the constituents of matter. Lattice 
QCD is the tool for non-perturbative numerical calculations of these interactions on 
a Euclidean space-time lattice [5S]. The heart of the computations is the solution of 
the lattice-Dirac equation, which translates to a linear system of equations Mx = 6, 
often for a large number of right hand sides [18]. The Dirac operator M is 75- 
Hermitian, or 75 Af = M^75, where, in one representation, 75 is a diagonal matrix 
with 1 and -1 on the diagonal. Also, M — niql — D, where niq is a parameter 
related to the quark mass and D is an operator. In addition to solving linear systems 
of equations, many current approaches [IHl HH UHl HI] require the solution of the 
eigenvalue problem {-f^M)^ — XiUi, for 100-200 smallest magnitude eigenvalues, Xi, 
and their eigenvectors, Ui (together we call them eigenpairs). Beyond the very large 
dimension and number of right hand sides, M becomes increasingly ill-conditioned as 
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•niq — > rricriticai- In lattice QCD this is known as critical slowdown and is a limiting 
computational factor. 

Traditionally these linear systems are solved by applying the Conjugate Gradi- 
ent (CG) on the A = M Hermitian operator. Although there are cases where 
BICGSTAB on the nonsymmetric operator can be twice as fast [501 HZ], in other 
cases, such as in domain wall fermions, CG is not only the fastest but also the most 
robust methocj^ In this paper, we focus on CG for Hermitian systems for two reasons. 
First, CG is characterized by optimal convergence. Second, the eigenvectors of A are 
also eigenvectors of 75A/. Hence, after an initial CG phase, computed eigenvectors 
can be used to deflate BICGSTAB on 75 A/ for the rest of the right hand sides. 

Solving a linear system with many right hand sides is an outstanding problem. 
Traditional approaches solve each system one by one. Unknowingly, such methods 
regenerate search directions within previously explored subspaces, thus wasting it- 
erations in successive right hand sides. Sharing information between systems has 
long been recognized as the key idea [3F|. This is typically performed through seed 
methods [4B., X 44 or through block methods [26^ 42j 23] ■ For Hermitian matrices, 
a selective sharing of only the useful part of information between systems can be 
achieved through invariant subspaces. Such ideas have been tried in QCD [8], but ef- 
fective deflation methods have only appeared recently. In section[2]we review deflation 
methods that either precompute the required eigenpairs or use a restarted method 
to compute eigenvectors while solving linear systems. Both approaches, however, are 
unnecessarily expensive. 

In this paper we present an algorithm that computes eigenvalue and eigenvector 
approximations of a Hermitian operator by reusing information from the unrestarted 
CG method. This is achieved by keeping a search space that includes current eigen- 
vector approximations and only the last few CG iteration vectors. The crucial step is 
how we restart this search space to keep computations tractable. The CG iteration is 
completely unaffected. Our experiments show that eigenvector convergence is similar 
to unrestarted Lanczos; an impressive achievement yet to be understood theoretically. 

Our motivating application is the computation of nucleon-nucleon scattering 
lengths with all to all propagators, where, for a time discretization between 64 and 
128, 1500-3000 right hand sides must be solved. Our algorithms are equally applicable 
to the more classic problem of computing nuclcon form factors using sequential prop- 
agators where 120 right hand sides are required. In a first phase, we solve 24 right 
hand sides with our method. Unconverged eigenvectors from one system improve 
incrementally during the solution of subsequent systems. In the second phase, after 
24 right hand sides, enough eigenvectors have been obtained to significantly reduce 
the condition number of the deflated matrix in the classic CG. In this phase we ob- 
serve speedups of 8-9 over the non-deflated CG, and, most importantly, the number of 
the deflated CG iterations remains almost constant as ruq approaches mcriUcai, thus 
removing the critical slowdown. 

2. Background and current approaches. Krylov methods provide one of the 
most important tools for solving large, general sparse linear systems. An excellent 
survey of recent developments and discussion on some of the open problems in linear 
systems appears in |49j . For symmetric (or Hermitian) positive deflnite (SPD) matri- 
ces, CG |30J remains the uncontested choice because it uses a three term recurrence 
to converge optimally, with minimum storage and computational requirements. Even 
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for the Hcrmitian case, however, it remains an open question as to how best solve a 
system with many (say s) right hand sides, 

Axi = bi, i — 1, . . . , s. 

One approach is to use block methods which work simultaneously on a set of vec- 
tors (131121 HZ]- They have favorable performance characteristics in memory hierarchy 
computers and usually reduce the number of iterations. However, their implementa- 
tion is involved as linear dependencies in the block must be removed [19j . More 
importantly, the total execution time often increases and there is no clear theoretical 
understanding of when to use them and with how large a block size (see [25J [59] for 
a recent review and [38J for a discussion of block methods in the context of multiple 
right hand sides and QCD). 

The other common approach is the use of seed methods [48l[7j|44], which reuse the 
Krylov subspace generated by one seed system to project the rest. After projection, 
a new seed system is chosen to iterate to convergence, and the idea is repeated until 
all systems are solved. Seed methods effectively reduce the number of iterations 
to solve each successive right hand side when these are highly related. Most seed 
algorithms do not store the previously generated Krylov spaces. Instead, while solving 
Axi — hi, they project the current Krylov vector from all the approximations Xj,j = 
i -|- 1, . . . ,s, updating all remaining linear systems at every iteration |48l 1^. Seed 
and block methods have also been combined [321. Still, this type of seed methods 
presents three difficulties: First, the hi vectors may not be all available at the same 
time. Second, we consider the contribution of each Krylov vector to all systems. This 
contribution is usually too small to warrant the additional expense, so the total time 
may increase. Third, the second problem becomes extreme when the hi are unrelated. 

The above difficulties can be avoided by noticing that for unrelated hi and for 
SPD matrices the only improvements from seed methods should come almost entirely 
from those common subspaces that a Krylov method builds for any starting vector: 
the extreme invariant subspaces. In particular, the eigenvalues near zero should be 
targeted as their deflation dramatically decreases the condition number of the matrix. 
Also, we would like to avoid using an eigensolver to compute those eigenpairs but to 
reuse the Krylov space built by CG. 

For non-Hermitian matrices, the GMRESDR method [5^ 155] computes approx- 
imate eigenspace information during GMRES(to) |46]. In GMRESDR, when the 
GMRES basis reaches m vectors, it is restarted not only with the residual of the 
approximate solution of the linear system but also with the nev < m smallest magni- 
tude Ritz pairs. It is known that GMRESDR is equivalent to the Implicitly Restarted 
Arnoldi ^50], so while GMRESDR solves the linear system, the nev Ritz pairs also 
converge to the smallest magnitude eigenpairs. This elegant solution transfers also to 
the Hermitian case, where GMRESDR becomes equivalent to thick restarted Lanczos 

For Hermitian systems, however, the GMRESDR approach presents three disad- 
vantages. First, it is based on GMRES which is much more expensive per step than 
CG. Second, restarting GMRES every m steps impairs the optimal convergence of 
CG. Third, restarting also impairs the convergence of unrestarted Lanczos to the nev 
required eigenpairs. In the context of this paper, the latter is an important disad- 
vantage, because the eigenspaces may not be obtained at the accuracy required for 
deflation of other systems. In that case, we would like to incrementally improve on 
the approximate eigenspace during the solution of subsequent linear systems. This is 
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performed in [22^, but only for GMRESDR. In Sectionjsjwe present a way to compute 
eigenpairs from the unrestarted CG. One of the components of our method is similar 
to a method developed independently by Wang et al. |Slj. As we show later, our 
method is not only cheaper but by making the appropriate restarting choices it yields 
practically optimal convergence. 

2.1. Deflating the CG. Once a set of seed vectors U have been computed, 
there are several ways to use U to "deflate" subsequent CG runs. Assume that the 
solution X of the system Ax = b has significant components in U . Given an initial 
guess Xq, we can consider another guess whose U components are removed ahead of 
time. This is performed by the Galerkin oblique projection |45j : 

(2.1) xo = S:Q + U{U"AU)-'^U"{b~~A5:o). 

Then xq is passed as initial guess to CG. These two steps are often called init-CG [TJ 
I23JI17]. The init-CG approach works well when U approximates relatively accurately 
the eigenvectors with eigenvalues closest to zero. 

When U spans an approximate eigenspace that has been computed to e accuracy, 
init-CG converges similarly to deflated CG up to e accuracy for the linear system. 
After that point, convergence plateaus and eventually becomes similar to the original 
CG [221 . A variety of techniques can be employed to solve this problem: using ?7 as a 
"spectral preconditioner" (TBI El HH [5] , including U as an explicit orthogonalization 
constraint in CG 31, 54], or combinations of methods as reviewed in [23 . Typically, 
these techniques result in convergence which is almost identical to exact deflation, 
but the cost per iteration increases significantly. In this paper, we focus only on the 
init-CG method. 

2.2. Locally optimal restarting for eigensolvers. We conclude this back- 
ground section with a restarting technique for eigensolvers that plays a central role in 
the method we develop in this paper. Because the Hermitian eigenvalue problem can 
be considered a constrained quadratic minimization problem, many eigenvalue meth- 
ods have been developed as variants of the non-linear Conjugate Gradient (NLCG) 
method on the Grassman manifold T!^. However, it is natural to consider a method 
that minimizes the Rayleigh quotient on the whole space {u^"^~^\u'-"^\ g^"^")^ , in- 
stead of only along one search direction. By 6'^™-' , u*^™) we denote the eigenvalue 
and eigenvector approximations at the mth step and g*^™^ = Aw'™-' — 5/(™)y(™) the 
corresponding residual. The method: 

(2.2) = RayleighRitz ({u^™"!), , m > 1, 

is often called locally optimal Conjugate Gradient (LOCG) |T2 [SI], and seems to 
consistently outperform other NLCG type methods. For numerical stability, the basis 
can be kept orthonormal, or u^™) — t'^'^^m^™"^^ can be used instead of u^™~'^\ for 
some weight r^™). The latter is the LOBPCG method [35]. 

Quasi-Newton methods use the NLCG vector iterates to construct incrementally 
an approximation to the Hessian, thus accelerating NLCG ^T]. Similarly, if all the it- 
erates of LOCG are considered, certain forms of quasi-Newton methods are equivalent 
to unrestarted Lanczos. With thick or implicit restarting, Lanczos loses this single 
important direction (m^™^^-'). Therefore, the appropriate way to restart Lanczos or 
Lanczos- type methods is by subspace acceleration of the LOCG recurrence [^ I5T]. 
When looking for nev eigenvalues, the idea can be combined with thick restarting 



CG eigenvalue deflation in QCD 



5 



so that the eigensolver search basis is restarted with an orthonormal basis for the 
following vectors [511 153j : 



(2.3) 



(m) (m) (m) ("i— 1) 

^1 J "2 7 • ■ • J ""net; ' ' ■ ■ ■ i "fc 



An efficient implementation is possible at no additional cost to thick restarting, 
because all orthogonalization is performed on the small coefficient vectors of the 
Rayleigh-Ritz procedure [51] . This technique consistently yields convergence which is 
almost indistinguishable from the unrest arted method. We see next how this scheme 
can help us approximate eigenvectors from within CG. 

3. The eigCG method. The idea is to use an eigenvector search space V within 



the CG iteration, which is restarted through (2.3 1, but not to let it dictate the next 
search direction. Instead, we leverage a window of the last m residuals computed by 
the unrestarted CG to build an appropriately restarted subspace. 

Consider the preconditioned CG algorithm in the left box of Figure |3.1[ We will 
exploit the equivalence of CG and Lanczos and the fact that the Lanczos vectors 
are the appropriately normalized CG residuals [25]. In the general case of an SPD 
preconditioner, P ^ I, the Lanczos method approximates eigenpairs of the symmetric 
matrix A = p-V2^p-i/2 f^^^ ^j^g Lanczos basis P~^/'^R = P^'^/'^[ri/ pi, . . . , rj/pj], 

where pj — (rj Zj)^!"^ and Zj = P~^l'^ry Let tj = P~^/'^RY and A be the Lanczos 
approximations to the eigenvectors and eigenvalues of A, respectively. Thes e ar e 



exactly the eigenpairs needed to deflate subsequent linear systems with A using (2.1 1. 
The inconvenient U basis can be avoided because CG needs an initial guess xq ~ x 
and not xq « P^/'^x. Specifically, let V = P~^R = [zi/ pi, . . . , zj / pj], let io be 
some approximation to x, and P^^'^xq the corresponding initial guess for the split 
prec onditioned system A. If xq is the deflated initial guess for system A by applying 



(2.1) on P^/^xq, then the initial guess xq to be given to CG is: 



xo = P'^/^xo = p-i/2(pi/2£^ + uk-^lj"p-^/\p-^/% - p-^'^Ai^)) 
= io + P^^RYA-^Y" R" p-\b - Aio) 
= io + VYA-^Y"v"ib - AS:q). 

The Y are the eigenvectors corresponding to the eigenvalues A of the Lanczos tridiag- 
onal matrix T„ ^ V"AV = (p-^/'^R)" A{p-^/^R). The is obtained at no extra 
cost from the scalars aj = r^_iZj / {p^ Apj) and /3j — r^-iZj / {r^_2Zj-i) computed 
during the CG iteration [ISl, p. 194]: 



(3.1) T„, = 



l/ai \/]h/oii 



m 

\/Pm+l/ am 1 /am + Pm / 



-1 



Traditionally, the Lanczos Ritz values and vectors are computed from T,n at the 
end of CG. This, however, requires the storage of all Zj or a second CG run to recom- 
pute them on the fly, effectively doubling the cost. Moreover, if the number of CG 
iterations is large, dealing with spurious eigenvalues at the end is expensive. Instead, 
we introduce an algorithm that restarts the search space for computing eigenvalues 
but does not restart CG. 
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The proposed algorithm, eigCG, adds new functionahty to CG as shown in Figure 
|3.1| in a Matlab-like format. It uses a set of m vectors, V, to keep track of the lowest 
nev Ritz vectors (or more accurately the P~^/^U) of the matrix A. Initially, V is 
composed of the normalized CG preconditioned residua ls z/ Vr^z (step 11.13), and 
the projection matrix T„i — AV is available from (3.1) (steps 11.1 and 11.10). 
When V reaches m vectors, we restart it exactly as we would restart an eigensolver 
using (2.3 1 with k = nev. The Rayleigh Ritz can be applied both on T„i and T^-i 
to produce Ritz vector coefficients for the last two consecutive steps (step 11.3). We 
also append a zero row to Y (step 11.4), so that Y and Y have the same dimension 
TO. In steps 11.5, 11.6 we use the Rayleigh Ritz again to compute an orthonormal 
Ritz basis for the space spanned by [Y, Y]. In step 11.7 the restarted basis V and its 
corresponding diagonal projection matrix are computed. Notice V remains orthonor- 
mal (or P-orthonormal in case of preconditioning). After restarting, eigCG continues 
to append the CG preconditioned residuals z to V, and to extend Ti with the same 
tridiagonal coefficients. The only exception is the first residual added after restart. It 
requires a set of inner products, z^ AV, to explicitly update the i + 1 row of matrix 
T. As shown in step 11.8, a matrix vector product can be avoided if we remember the 
previous vector tprev (step 10.1) and note that Az = Ap — (ijtprev After this point, 
no other extra computation is needed until the next restart (to — 2nev iterations). 



The CG algorithm 

r^b - Ax; j ^0 
while ||r||/||ro|| > tol 

i - J + 1 

z = p-^r 

Pprev P ^ ^ ^ 

if U == 1) 

p — z 
else 

l^j — p/ Pprev 

p = z + 13 jp 
end 
t = Ap 

aj = r^z/{p"t) 

X = X + UjP 

r = r 
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The eigCG(nei', to) additions to CG 
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Solve for nev lowest eigenpairs of 
T„,Y = YM and T^_iF_= YM 
Add an mth row of zeros: Y — [Y;0] 
Q=orth([yr]), set H = Q^TmQ 
Solve for the eigenpairs of HZ = ZM 
Restart: 

V = V{QZ), i = 2nev, T, = M 
Compute zfAV, the i + 1 row of T^+i 

W t Pjtpj-Q^ 

else 
end 

i = i + l 
V{:,i) = zl^p 



Fig. 3.1. The eigCG(nev,m) algorithm approximates nev eigenvalues keeping a search space of 
size m > 2nev. The classical CG algorithm is shown in the left box. To obtain the eigCG algorithm 
we extend the CG steps 0, 10, and 11, with new steps numbered with decimal points. The right box 
shows these additions to CG. 



Computationally, eigCG requires storage for to vectors, but no additional matrix- 
vector operations or other costs during the CG iterations except at restart. At restart, 
all operations that involve matrices of size m, which includes the orthogonalization at 
step 11.5, have negligible cost. The only two expenses are: step 11.7 which requires 
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0{2N ■ m ■ 2nev) flops for V{QZ), and step 11.8 which requires 0{2N ■ 2nev) flops for 
AV . This expense occurs every (to — 2nev) iterations, for a total of 

0{4:N ■ nev ■ (to + 1)/(to — 2nev)) flops per iteration. 

Interestingly, the average cost per step decreases with to and for large enough to > 
lOnev it approaches 0{4:Nnev) flops per step. This is the same as the computation 
of 2nev Ritz vectors from the full basis of unrestarted Lanczos. Finally, the extra 
computations are fully parallelizable: First, the computation V{QZ) incurs no syn- 
chronization and can be performed in a cache efficient way with level 3 BLAS routines. 
Second, the accumulation of the dot products V^Az can be delayed until the first 
dot product during the next CG iteration to avoid an extra synchronization. 

3.1. Convergence and comparison with other methods. In exact arith- 
metic, the vectors in V remain orthonormal (or P-orthonormal in case of precondi- 
tioning) as linear combinations of the Lanczos vectors. In fioating point arithmetic, 
the Lanczos property guarantees Zj _L V and the numerical accuracy of the Tm coeffi- 
cients until some eigenpairs converge to square root of machine precision (^emachll^lD- 
After that, spurious eigenvalues start to appear in T„i but without compromising the 
accuracy of the correct ones. Converging but unwanted eigenpairs, lying in the high 
end of the spectrum, pose little stability threat because they are purged at every 
restart (step 11.7). Nevertheless, we should choose to such that the highest eigenpair 
does not converge in m — 2nev iterations, i.e., between successive restarts. Unless the 
largest eigenvalues are highly separated, this is not usually an issue. 

To avoid spurious wanted eigenvalues, our eigCG algorithm facilitates a straight- 
forward implementation of selective orthogonalization and a cheaper version of par- 
tial orthogonalization |43j. For any Ritz vector Vy, where y is any column of Y, its 
residual norm can be monitored at no additional cost through the Lanczos property, 
IjAFy — HiVyW — \(3j\\ym\- When V is restarted we can selectively orthogonalize 
against all those Ritz vectors whose accuracy approaches ^tmach ■ Alternatively, we 
can simply orthogonalize the CG residual against all the restarted V . The additional 
expense is minimal and the benefits are twofold; spurious eigenvalues are avoided in 
and the convergence of CG improves as well. We do not further explore this approach 
in this paper for three reasons; First, to keep the presentation simple and focused 
on the main eigCG idea; Second, as shown in section |4] we use eigCG incrementally 
which resolves spurious issues at a higher level; Third, in our QCD application code 
we use single precision for all computations except for dot products. Therefore we do 
not expect any spurious eigenvalues to show up while solving one linear system. 

The structure of T,„ after restart is reminiscent of the thick restarted Lanczos 
(TRLAN) [5^. However, after the first restart, the residual of TRLAN (equivalently 
of ARPACK [371 or of GMRESDR is the residual produced by 2nev steps of some 
other Lanczos process with a different initial vector. Our residuals continue to he the 
Lanczos vectors of the original CG/Lanczos process. Therefore, the convergence of the 
CG is unaffected. Also, eigenvalue convergence in eigCG is expected to be different 
from TRLAN or GMRESDR. 

Recently, the recycled MINRES (RMINRES) algorithm was developed indepen- 
dently in |54|- It is based on the same technique of reusing some of the MINRES 
residuals in a basis V . The critical difference with eigCG is that RMINRES restarts 
as in TRLAN by keeping only the nev harmonic Ritz vectors closest to zero and not 
the previous directions (i.e., fc in (|2.3[)). As a result, eigenvalues converge only to 
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Residual convergence of 8 smallest eigenpairs with eigCG(8,20) 
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Fig. 3.2. Left graph: eigCG convergence with various values of (nev,m), matching unrestarted 
Lanczos. Right graph: convergence of 8 smallest eigenpairs of eigCG(8,20). 



a very low accuracy and then stagnate. This is acceptable in [54j because their appli- 
cation involves systems of slightly varying matrices which cannot be deflated exactly 
by each other's eigenvectors. Moreover, RMINRES tries to identify and maintain 
directions that tend to repeat across Krylov subspaces of different linear systems. To 
be effective, the basis V must be kept orthogonal to these vectors, thus increasing 
the expense of its MINRES iteration. Our eigCG focuses on getting the eigenvectors 
accurately which can later be deflated inexpensively with init-CG. 

The use of Rayleigh Ritz and thick restart guarantee monotonic convergence of 
the nev Ritz values from V . Beyond that, it is not obvious why eigCG should converge 
to any eigenpairs, let alone with a convergence rate identical to that of unrestarted 
Lanczos! With thick restarting alone (as in RMINRES) the important information 
that is discarded cannot be reintroduced in 1/ as future residuals of the unrestarted 



CG are orthogonal to it. By using the restarting of eq. (2.3), with modest nev values, 
almost all Lanczos information regarding the nev eigenpairs is kept in condensed form 
in V . Then, the new CG residuals do in fact represent the steepest descent for our 
Ritz vectors, and eigCG behaves like unrestarted Lanczos. 

The left graph of Figure shows residual convergence for the smallest eigenpair 
under various eigCG(ri,ew, to) runs on a matrix of size 12 x 8** = 49152 that repre- 
sents the spectrum of a typical Wilson fermion matrix with light quark mass. The 
eigCG(l,3) holds only three vectors in V , which cannot capture the information well, 
and stagnates. Keeping as few as three Ritz vectors (eigCG(3,9)) improves the stagna- 
tion point dramatically, while with eight Ritz vectors we were able to reach accuracy 
of lc-12. The figure only shows convergence until step 1540 which is where the lin- 
ear system converged. It is remarkable, however, that until it reaches the stagnation 
point eigCG converges at the rate of unrestarted, fully reorthogonalized Lanczos. We 
have observed the same property for all 8 smallest eigenvalues, whose convergence is 
depicted in the right graph in Figure |3.2[ A theoretical analysis of this surprising 
behavior will be the focus of our future research. 



Figure 3.3 compares the effectiveness of eigCG in approximating many eigenpairs 
to those of the competing GMRESDR and RecycledCG methods (the latter imple- 
menting the RMINRES ideas on CG). The experiments are run on a Wilson Fermion 
lattice of size 12x12* with periodic boundary conditions and quark mass equal to the 
critical mass. We use odd-even preconditioning (which yields a matrix of half the size) 
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Fig. 3.3. Accuracy of 10 smallest eigenvalues computed by eigCG(10,50), GMRESDR(55,34), 
and RecycledCG(50,30) methods, at 830, 1000, and 1100 iterations. For each method and case, 
smallest eigenvalues correspond to leftmost bars. 



and focus on the symmetric normal equations. We run the experiments in Matlab 7.5, 
on a Mac Pro, dual processor, dual-core, 3 GHz Intel Xeon. We test eigCG(10,50), 
GMRESDR(55,34), and RecycledCG(50,30) so that all methods have similar memory 
requirements. We plot the residual norms of the 10 smallest eigenvalues computed 
by the three methods at 830, 1000, and 1100 matrix-vector products. GMRESDR is 
effective at computing the lowest three eigenvalues but the accuracy of the rest does 
not seem to improve with further iterations. RecycledCG cannot improve the low 
accuracy obtained early during the iteration. Our method not only computes more 
accurate and more eigenvalues than both other methods, it also continues to improve 
them with more iterations. Moreover, GMRESDR took 1168 iterations and 1505 sec- 
onds to solve the linear system to accuracy 10~®, while eigCG took 1010 iterations 
and 410 seconds. Thus, eigCG was 3.7 times faster, while also computing a much 
better eigenspace. 

For comparison, we also report some results from running the GMRESDR directly 
on the nonsymmetric Dirac operator. This is sometimes reported to be twice as 
fast as CG on the normal equations because it avoids the squaring of the condition 
number. However, such results are often obtained from heavier quark masses, where 
the problem is not as difficult or interesting. In our 12 x 12'^ problem, and with the 
same accuracy and computational platform, GMRESDR(55,34) took 1000 iterations 
and 1120 seconds. GMRESDR(85,60) improved convergence to 585 iterations, but 
its execution time was still 1001 seconds. For low quark masses, as in our case, 
GMRESDR is not only slower than eigCG, but it has more difficulty computing 
accurate eigenpairs even with more memory as shown in Figure |3.4[ 



4. Incrementally increasing eigenvector accuracy and number. When 
the CG iteration converges before equally accurate eigenvectors can be obtained, 
these eigenvectors cannot deflate effectively the CG for the next right hand side. 



Deflating with the eigenvectors obtained in the example of Figure 3.2 yielded 10% 
faster convergence in the next linear system. As we mentioned earlier, using the 
resulting ^ as a spectral preconditioner is much more effective, resulting in 50% 
faster convergence in the same example. Still, we would like to avoid this expense on 
every step of CG, and more importantly to obtain more eigenvectors of A for more 
effective deflation. 
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Fig. 3.4. Accuracy of 10 smallest magnitude eigenvalues computed by GMRESDR(55,34) 
and GMRESDR(85,60), at 585, 1000, and 2200 iterations on the nonsymmetric operator. GM- 
RESDR(85,60) is not reported at 2200 MV since it reached machine precision much earlier. 



A simple outer scheme can be designed that calls eigCG for Si < s right hand 
sides and accumulates the resulting nev approximate Ritz vectors from each run into 
a larger set, U. 



The Incremental eigCG algorithm 


U=[],H=[] 




% accumulated Ritz vectors 


for i = 1 : s\ 




% for si initial rhs 


xo = UH-^U"b 


i 


% the init-CG part 


[xi,V,M] = eig 


CG{nev,m,A,xo, 


bi) % eigCG with initial guess xq 


V = orthonormalize V against U 


% (Not strictly necessary) 




H U"W ■ 


% Add nev rows to H 


W = AV,H = 




Set U = [U, V] 




% Augment U 


end 







If we assume that U contains converged eigenvectors, the Ritz vec;tors prochicred in V 
during the solution of the next linear system will be in the orthogonal complement 
of span(?7), as xq is deflated of U. If some of the vectors in U are not converged 
enough, the eigCG pro(hic;es V with directions not only in new eigenvectors but also 
in directions that complement the unconverged Ritz vectors in U. Thus, the accuracy 
of U incrementally improves up to machine precision as our experiments indicate. 

Although it is advisable for computational reasons to keep m large, nev should 
not be too large because it increases the eigCG cost without providing more than a few 
good eigenpair approximations from one linear system. Computing many eigenpairs 
through the Incremental eigCG allows us to choose modest values for nev. In our 
experiments, we used nev = 10 while larger values did not consistently improve the 
results. Other dynamic techniques for setting nev based on monitoring eigenvalue 
separation and convergence are also possible but not explored in this paper. 

Computationally, if the number of vectors in [/ is /, the above algorithm costs nev 
matrix-vector operations and (Alnev+Anev^) N +2lnevN flops of level 3 BLAS for each 
new system solved. For very large I this time could become noticeable, in which case 
we could use the following two optimizations. First, we note that orthogonalization 
of the new vectors is only performed to reduce the conditioning of H, so it does 
not need to be carried out accurately. Moreover, it would suffice to orthogonalize 
only those Ritz vectors in V with large components in U. These can be identified 
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by whether their Ritz values he within the range of aheady computed eigenvalues. 
Second, we could lock accurate eigenvectors out of U so that they do not participate 
in building the H matrix. Naturally, if the number of right hand sides is large enough, 
s ^ si, any of the above costs are amortized by the much faster convergence of the 
deflated systems for bi, i > si. Nevertheless, our timings show that the overhead of 
the incremental part is small even for 240 vectors. 

Storage for the total of I — sinev vectors of U is usually not a challenge because 
it is on the order of the right hand sides or the solutions for the s linear systems 
(assuming s > sinev). Moreover, U is not accessed within eigCG or CG, so it can 
be kept in secondary storage if needed. Still, it would be beneficial computationally 
and storage- wise if we could limit the size of U to only a certain number of smallest 
eigenvectors. We have observed that if after augmenting U with V, we truncate U to 
include only a certain number of smallest eigenvectors, the accuracy ceases to increase 
with V from subsequent systems. Although this problem merits further investigation, 
in the QCD problems in which we are currently interested the number of right hand 
sides is large enough to allow us to grow [/ to a large number. 

5. Numerical experiments. In this section we present results from two real 
world applications. After describing the computational platform and applications, 
we address four issues. In section |5.2[ we study convergence of eigenvalues in the 
Incremental eigCG. In section [53] we study the improvement in convergence of the 
linear systems as bigger spaces are deflated in init-CG. In section |5.4| we study the 
convergence invariance of the resulting deflated init-CG under various quark masses. 
Finally, in section |5.5| we provide timings that show the small overhead and high 
efficiency of the method on a supercomputing platform. 

5.1. Chroma implementation and two lattice QCD tests. We imple- 
mented our algorithms in C and interfaced with Chroma, a lattice QCD C+-I- software 
base developed at Jefferson National Lab [T5]. All our experiments were done in sin- 
gle precision with dot product summations in double precision. In the following three 
sections, we report tests on an 8 node dual socket, dual core cluster with 4GB of 
memory per node. The timings in section [5.5| are reported from a production all-to- 
all propagator calculation on 256 processors of the Cray XT4 at NERSC, Lawerence 
Berkeley National Lab. 

The Dirac matrices used in this paper come from two ensembles of an anisotropic, 
two flavor dynamical Wilson fermion calculation. In both cases the sea pion mass is 
about 400(36) MeV and the lattice spacing is 0.108(7)fm. The anisotropy factor of 
the time direction is about 3 and the critical mass was determined to be -0.4188 in 
all cases. We studied the behavior of our algorithm on several different lattices from 
these ensembles and found insigniflcant variation in performance. In one case the 
lattice size is 16^ x 64 for a matrix size of iV = 3, 145, 728, while in the other it is 
243 X 64 for a matrix size of iV = 10, 616, 832. We refer to these cases as 3M and lOM 
lattices, respectively. The odd-even preconditioned normal equations were solved in 
both cases with storage requirements about lOOiV per matrix. These ensembles are 
used in current lattice QCD calculations with more than one hundred right hand sides 
at Jefferson Lab, hence our algorithmic improvements have direct implications in the 
cost of currently pursued lattice QCD projects. 

In Figure [STT] we present the lowest part of the spectrum for the matrices resulting 
from the two test lattices and for a range of quark masses. As expected, the lowest 
eigenvalue becomes smaller as the quark mass approaches the critical quark mass (- 
0.4188) leading to large condition numbers and to the critical slowdown. However, 
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Case: 16 x64. Lowest 80 eigenvalues for various masses 
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Fig. 5.1. The smallest eigenvalues for the M positive definite matrix under various quark 
masses. 80 smallest eigenvalues are shown for the 3M lattice (left graph) and 50 smallest eigenvalues 
for the lOM lattice (right graph). 



the more interior an eigenvalue is the lower the rate that it decreases with the mass. 
This also is expected, because the largest eigenvalue and the average density of the 
spectrum is a function of the discretization volume and not as much of the quark mass 
[3]. Therefore, by deflating a sufficient number of lowest eigenvalues not only do we 
reduce the condition number significantly, but we also make it practically constant 
regardless of quark mass, thus removing the critical slowdown. 

For the 3M lattice, for example, deflating 12 vectors from the lightest masses yields 
a condition number similar to the heavy mass case -0.4000. For the lOM lattice, about 
30 vectors are required for the same effect. These examples show the limitation of all 
deflation methods, not only eigCG. As eigenvalue density increases with volume, more 
eigenvalues need to be deflated to achieve a constant condition number. Traditionally, 
scalabihty across volumes is the topic of multigrid methods [U El |6j 1^ . Although, 
our methods are very effective in that direction too, they only guarantee constant 
time across different masses. 

5.2. Eigenvalue convergence in Incremental eigCG. In this section, we 
show how Incremental eigCG improves partially converged eigenpairs and how it 
produces additional interior eigenpairs. Figure |5.2| shows the convergence of certain 
eigenpairs at every outer iteration of the Incremental eigCG for each of the two lattices 
and for three quark masses. In all cases, we use eigCG(10,100) to solve 24 unrelated 
right hand sides. After eigCG converges to a system, the computed 10 Ritz vectors 
augment U . We explicitly perform a Rayleigh Ritz on U to report the best eigenvector 
approximations. We show the convergence of every 10th Ritz pair from the step 
they were first produced by eigCG and after each outer iteration. For example, the 
convergence history of the 30th smallest eigenpair is the third curve from the bottom 
in the graphs. The curve first appears at outer step 3 and improves after the solution 
of each subsequent system. In all cases, eigenpair approximations continue to converge 
and more eigenpairs are calculated incrementally with more outer iterations. 

The top graphs in Figure |5.2| which correspond to a very heavy mass and thus a 
small condition number, show that linear systems converge faster than eigenvectors. 
Incremental eigCG requires the solution of 10 right hand sides for the 3M lattice 
and 18 right hand sides for the lOM lattice to achieve machine precision for the first 
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10 eigenpairs (first bottom curve). One could continue the first eigCG until enough 
eigenvectors converge, but this would not take advantage of these iterations to solve 
linear systems. Moreover, deflating only a couple of eigenvalues is sufficient for low 
condition numbers. As the condition number deteriorates with lighter quark masses 
(middle and bottom graphs) , eigCG takes far more iterations and thus can obtain the 
smallest 10 eigenvalues to machine precision by solving less than five linear systems. 
This behavior is consistent with the distribution of the eigenvalues in Figure |5.1[ 

Incremental eigCG finds about the same number of extremal eigenvalues per 
number of linear systems solved, with a gradual decrease in the rate for more interior 



eigenpairs, especially with lighter masses (see Figure 5.2 1. The former is expected 
because eigCG builds a different space for each right hand side, so Incremental eigCG 
has no information for interior eigenvalues until they are targeted. There are three 
reasons for the gradual rate decrease. First, the graphs in the figure are shown against 
the number of linear systems solved so the scale is not representative of the actual 
work spent to find these eigenvalues. Indeed, later systems are deflated with more 
vectors in U and so take fewer iterations to solve. Second, deflation does not improve 
the relative gaps between unconverged eigenvalues, so with fewer iterations eigCG 
cannot recover as much eigen-information by solving one linear system. Third, as 



discussed in section 2.1 for init-CG, the convergence of interior eigenvalues plateaus 
when it reaches the accuracy of more extreme deflated eigenvalues. Then, eigCG tries 
to improve the already computed eigenvectors rather than compute new ones. Loss 
of orthogonality in CG can also contribute to this. In this paper, we do not seek to 
alleviate this last problem through spectral preconditioning or by orthogonalization 
as in RMINRES, because the additional cost for obtaining more interior eigenvalues 
may not be justified as the exterior eigenvalues determine the condition number to 
a greater extent. Even so, we show how much can be achieved with no substantial 
additional cost to CG. 



We conclude this study of eigenvalue convergence by showing in Figure 5.3 the 
residual norms of the 240 computed Ritz vectors after all 24 linear systems have been 
solved with Incremental eigCG. The graphs show that for heavy masses more interior 
eigenvalues are found to better accuracy than with lighter masses, but with lighter 
masses extreme eigenvalues are found to much better accuracy. This is particularly 
evident in the lOM lattice. We have found that deflating Ritz vectors with residual 
norm below ^Jtmach is more effective. Thus, about 80-110 eigenpairs can be deflated 
for both lattices, except for the lOM lattice with a mass close to critical one. In that 
case , a spectrally preconditioned eigCG could further improve interior eigenvalues. 

5.3. Linear system convergence with init-CG. Figure |5.4| shows the CG 
convergence history for solving three linear systems for each of the two lattices, each 
system having 48 right hand sides. For the flrst 24 right hand sides we use Incremen- 
tal eigCG(10,100), and for the 24 subsequent systems we use init-CG deflated with 
the obtained 240 approximate eigenvectors. Therefore, CG convergence is the slow- 
est for the first system without deflation and improves as groups of 10 Ritz vectors 
are accumulated by Incremental eigCG. As expected from the eigenvalue spectrum 
of these matrices, there are diminishing returns from deflating an increasing number 
of eigenvectors. However, these diminishing returns start approximately at the point 
where the smallest non-deflated eigenvalue becomes relatively invariant of the quark 
mass used. In Figure |5.4| the init-CG used for the flnal 24 vectors converges in ap- 
proximately the same number of steps regardless of quark mass, yielding speedups of 
more than 8 in the most difficult cases. 
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Fig. 5.2. Convergence of eigenpairs in the outer loop of Incremental eigCG( 10,100) . A total 
of 24 linear systems are solved, one per outer iteration. Each curve shows the convergence of the 
residual norm of the (10 X i)-th innermost Ritz vector, which is obtained during the i-th outer 
iteration and improved in iterations j + 1, . . . ,24. Typically, residual norms for eigenvalues (10 X 
i) + 1, . . . , (10 X i) + 9 fall between the two depicted curves for (10 X i) and (10 X (i 1)). Left 
graphs show results from the 3M lattice and right graphs from the lOM lattice. Matrices coming 
from three different masses are considered for each lattice; a very heavy mass (top), the sea-quark 
mass (middle), and a mass close to the critical mass (bottom). Slower CG convergence with lighter 
masses allows eigenvalues to be found faster. 
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Fig. 5.3. Residual norm of the 240 Ritz vectors computed at the end of Incremental eigCG 
on 24 right hand sides. Left graph shows results from the 3M lattice and right graph from the lOM 
lattice. The three curves correspond to three different quark masses (a heavy, the sea quark, and 
one close to the critical mass). 



The Incremental eigCG curves for the first 24 systems show a sublinear conver- 
gence behavior which begins at the accuracy at which deflated approximate eigenvec- 
tors were obtained. Instead of a plateau, however, we see a gradual deterioration of 
the rate of convergence because extreme eigenpairs, which have a bigger effect on the 
condition number, are obtained more accurately than interior ones. For simplicity, we 
did not address this problem for the first 24 right hand sides during the Incremental 
eigCG, but only for the 24 subsequent right hand sides to be solved with init-CG. 
In many applications, including ours, the number of subsequent systems is large so 
optimizing the initial Incremental eigCG phase may not have a large impact. 

For the second phase, we restart the init-CG when the norm of the linear system 
residual reached within an order of magnitude of machine precision (scaled by the 
norm of the matrix). As seen in the graphs of the previous section, most defiation 
benefits come from Ritz vectors with residual norm below this threshold. The graphs 
in Figure |5.4| show that after a short lived peak, the restart completely restores the 
linear CG convergence. A dynamic way to choose the restart threshold can be devised 
based on the computed eigenvalues and their residual norms, and balancing the bene- 
fits of reducing the condition number with the expense of restarting. Such a technique 
goes beyond the scope of this paper. 

5.4. Removing the QCD critical slowdown. We have run similar experi- 
ments with 48 right hand sides for both the 3M and lOM lattices on several matrices 
coming from a wide range of quark masses; from heavy to below critical. In Figure 
|5.5| we show how large an eigenvalue we can expect the init-CG algorithm to defiate 
and how well. This eigenvalue is the denominator of the condition number of the de- 
fiated operator. We consider three thresholds lE-3, lE-4, and lE-5 and plot for each 
matrix the largest eigenvalue returned by Incremental eigCG that has residual norm 
less than these three thresholds. For the 3M lattice (left graph) we see that lighter 
masses allow our algorithm to find eigenvalues deeper in the spectrum very accurately. 
For threshold lE-3, the eigenvalues identified by Incremental eigCG are very close to 
0.009 for all physically meaningful masses. Therefore, we expect similar conditioning 



and number of iterations regardless of the mass. This is confirmed in Figure 5.6 We 
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Fig. 5.4. Linear system convergence solving 40 unrelated right hand sides. Left graphs show 
results from the 3M lattice and right graphs from the lOM lattice. Matrices coming from three dif- 
ferent masses are considered for each lattice; a very heavy mass (top), the sea-quark mass (middle), 
and a mass close to the critical mass (bottom). Incremental eigCG(10,100) is used for the first 24 
systems. The final 240 approximate eigenvectors are used in init-CG to solve the rest 24 systems. 
Restarting CG close to single machine precision resolves any convergence delays associated with 
init- GG. 
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Fig. 5.5. For each quark mass and its corresponding matrix we plot the largest eigenvalue of 
the 240 computed by Incremental eigCG that has residual norm less than some threshold. Each of 
the three curves corresponds to a different threshold. 



also note the weakening ability of Incremental eigCG to identify interior eigenvalues 
below the critical mass because of loss of orthogonality in eigCG/init-CG. Similar 
observations can be made for the lOM lattice, with the exception that the eigenvalues 
that are accurate to lE-3 tend to be smaller near the critical mass. Still, the ratio 
between the lE-3 accurate eigenvalues at masses -0.4112 (sea quark mass) and -0.4180 
(critical) is less than 4, implying a slowdown of no more than two over heavier masses. 
This is confirmed in Figure [STH] 



Case: 16 x 1 6 x 16 x 64 x 12 



O Original CG 
* init-CG with 240 evecs 



sea quark mass 



5.425 -0.42 -0.415 -0.41 
Quark mass 



-0.405 -0.4 



Case: 24 x 24 x 24 x 64x 12 



O Original CG 
* init-CG with 240 evecs 



sea quark mass 



'Oo 



Oo 



o 



* ** ** * ****** * 

-0.415 -0.41 -0.405 -0.4 

Quark mass 



Fig. 5.6. For each quark mass and its corresponding matrix we plot the average number of 
iterations required by init-CG to solve the rest 24 systems. For comparison the number of iterations 
of the non-deflated CG is reported. 

Figure |5.6| shows the average number of iterations required by init-CG to solve 
the 24 right hand sides for the two lattices and for each mass. We also plot the 
number of iterations required by the non-deflated CG. Speedups close to an order of 
magnitude are observed and, more importantly, the number of iterations of init-CG 
is almost constant across meaningful masses. Again, we note that a more fastidious 
use of spectrally preconditioned eigCG would have resulted in further reduction in 
iterations, especially for the lOM lattice, but this reduction would have been far less 
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Table 5.1 

Execution times in seconds for phase 1 where we apply Incremental eigCG, for phase 2 where 
we switch to init-CG, and for the whole application that includes also about 90 seconds of application 
setup time. We also show the times from using the original CO in Chroma without deflation. 



Time for first 24 rhs 


Time for next 232 rhs 


Total apphcation time 


Chroma CG 527.9 
Incr. eigCG 323.2 


Chroma CG 5127.2 
init-CG 951.2 


Chroma CG 5751.5 
Incr. eigCG 1365.8 


speedup 1.6 


speedup 5.4 


overall speedup 4.2 



substantial relative to those reported in this paper. Moreover, this would only be 
needed in the physically non-meaningful range of masses (3M lattice) or very close 
to the critical mass (lOM lattice). Instead, we showed why the critical slowdown can 
be removed in principle when the number of right hand sides is large and derived an 
algorithm that achieves this. 

5.5. Cray XT4 timings. We have run the lOM lattice on the Cray XT4 at 
NERSC for a real world, all-to-all propagator calculation p3] with time and spatial 
even-odd dilution. Two different random noise vectors were used for a total of 256 
right hand sides. This is a stochastic method of estimating all matrix elements of 
the quark propagator (i.e. the Dirac matrix inverse). For details regarding all-to-all 
propagator calculations see [32]. The quark mass used was niq = —0.4125 which is the 
same as the dynamical quark mass used to generate the gauge configurations. The 
corresponding pion mass was determined to be roughly 400MeV and the spatial box 
was about 2.6fm. These are typical parameters in current lattice QCD calculations, 
although lighter masses would clearly benefit our methods. Our codes are compiled 
with C++ using the -02 and loop unrolling flags. 

As in our previous experiments we run two phases: First we apply Incremental 
eigCG(10,100) on the first 24 right hand sides. Second, on the following 232 systems, 
we apply init-CG deflated by the accumulated 240 vectors in U . We also restart the 
init-CG as in the previous sections, thereby incurring the deflation cost twice. 

Table [STT] shows the execution times for the two phases and the overall time for the 
application. We also report times for the original CG code as implemented in Chroma. 
All codes use the Chroma implementation of the sparse matrix-vector multiplication. 
The native CG in Chroma implements various architectural optimizations such as 
SSE vector processing and hand loop unrolling. Our CG implementation achieves an 
iteration cost only about 2% more expensive than the native Chroma CG. However, 
the benefits of deflation are far more significant. Linear systems deflated with the 240 
vectors are solved 5.4 times faster than regular CG. Moreover, while obtaining these 
240 vectors, our algorithm is still 1.6 times faster than if we were to simply run CG. 
Overall, the application runs with a speedup of 4.2. 

Next, we quantify the relative expense of the various components in our algorithm; 
specifically, the relative costs of deflating with U , of incrementally updating U outside 
eigCG, and of updating V inside eigCG. In Figure |5.7[ the left graph shows the 
percentage breakdown of execution time among eigCG, the incremental update, and 
the deflation part of init-CG. We show these only for the 24 vectors in the flrst phase 
where Incremental eigCG is used. 

Our first observation is that the initial deflation part is negligible. Even deflating 
230 vectors (before solving the 24th right hand side) constitutes less than 0.5% of 
the time spent in Incremental eigCG. In the second phase, when deflation with 240 
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Fig. 5.7. Left graph: Percentage breakdown of execution time for the three components of 
Incremental eigCG: the eigCG inner iteration, the initial deflation, and the incremental update. 
Right graph: The solid line shows the execution time in seconds of Incremental eigCG for the k-th 
right hand side, k = 1, . . . , 24. The bars below each point show the time it would take to solve a 
linear system by switching to init-CG at the k-th right hand side. 



vectors occurs twice and when the cheaper CG (i.e., eigCG(0,0)) is used, the total 
expense of deflation is less than 3% of the time to solve a system. 

The cost of the incremental update of U increases linearly with the number of 
right hand sides. As shown in Figure |5.7| the cost of updating 230 vectors with 10 
additional ones (the 24th step) constitutes about 15% of the total Incremental eigCG 
time. This is a result of re-orthogonalizing the new vectors V against U and also of 
the fact that the Incremental eigCG is about 2.5 times faster than the original eigCG, 
so the relative cost of updates is pronounced. Lastly, in our application, updating V 
during eigCG(10,100) costs an additional 21% over simple CG. Clearly, these relative 
costs depend on the cost of the matrix vector operation, and, for the general case, 
one must refer to the computational models in Sections [3] and |4] 

Finally, we study the scenario where Incremental eigCG is run on fewer right hand 
sides, hence accumulating fewer deflation vectors for phase two. The right graph in 



Figure 5.7 shows two curves. The points on the solid line show the time taken by 
Incremental eigCG for the fc-th right hand side, k = 1, . . . , 24. Immediately below 
each point, the bar represents the time init-CG would take to solve a linear system 
deflating only the 10(A: — 1) vectors accumulated by Incremental eigCG up to the 
{k — l)-th right hand side. The lower runtime is due to three reasons: We avoid 
the eigCG-related costs, the cost of updating U, and last but most important, the 



restarting of init-CG avoids the plateaus shown in Figure 5.4 speeding the method 
by a factor of two (when k = 24) . 

Figure |5.7| also shows that the most significant speedups are obtained by running 
Incremental eigCG for 8-9 right hand sides (deflating 80-90 vectors). After that 
improvements wane but still manage to add an additional factor of 1.6 by k = 24. 
It is possible to monitor these improvements at runtime, thus running Incremental 
eigCG for the number of right hand sides that will minimize overall execution time, 
before switching to init-CG. For example, the following table presents the optimal k 



for this problem under various numbers of total rig 


;ht hand sides: 




total number of RHS 
optimal # RHS in Incremental eigCG 


6 12 24 32 48 
3 7 9 11 19 
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Clearly, the more right hand sides we need to solve, the more vectors it pays to solve 
with Incremental eigCG. 

6. Conclusions. The numerical solution of large linear systems with multiple 

right hand sides is becoming increasingly important in many applications. Our origi- 
nal goal was to address this problem in the context of lattice QCD where, in certain 
problems, hundreds of linear systems of equations must be solved. For general SPD 
matrices, we have argued that extreme invariant subspaces are the only useful infor- 
mation that can be shared between Krylov spaces built by different, unrelated initial 
vectors. We have also argued that the critical slowdown, observed in lattice QCD 
computations when the quark mass approaches a critical value, is caused by a de- 
crease in exactly the same extreme (smallest) eigenvalues, while the average density 
of more interior eigenvalues remains unaffected. In our approach we take advantage 
of the many right hand sides by incrementally building eigenspace information while 
solving linear systems. This eigenspace is used to accelerate by deflation subsequent 
linear systems and thus remove the critical slowdown. 

The algorithm we have developed that derives eigenspace information during the 

CG method distinguishes itself from other deflation methods in several ways. First, 
we do not use restarted methods, such as GMRES(m), so our linear system solver 
maintains the optimal convergence of CG. Second, by using the readily available CG 
iterates, we build a local window of Lanczos vectors with minimal additional expense. 
Third, we use the locally optimal restarting technique to keep the size of the window 
bounded. Our resulting algorithm, eigCG, has the remarkable property that the Ritz 
pairs converge identically to the unrestarted Lanczos method, to very good accuracy 
and without having to store the Lanczos vectors. In our experiments, we were able 
to find 50-80 eigenpairs to machine precision by solving 24 linear systems. Current 
state-of-the-art eigenvalue eigensolvers would require the equivalent of 50-80 linear 
system solves to produce the same information. 

We believe the proposed eigCG is a breakthrough method. Because it is purely 

algebraic, it goes beyond lattice QCD to any SPD problem with multiple right hand 
sides. Moreover, it does not require the right hand sides to be available at the same 
time, so it is ideal for time dependent problems. In this paper, we have left some 
questions unanswered (especially those relating to the theoretical understanding of 
the method) and pointed to many directions that eigCG can be improved. Among 
this wealth of future research, a particularly exciting direction is a new eigensolver 
that redefines the state-of-the-art in the area. Finally, a general purpose code is 
currently under development. 
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